-
Notifications
You must be signed in to change notification settings - Fork 188
OpenShift Deployment with GPU Support #376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for vllm-semantic-router ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
22073a8
to
27750ad
Compare
This commit adds comprehensive OpenShift deployment support with GPU-enabled specialist model containers, providing a complete automation solution for deploying the semantic router to OpenShift clusters. **Core Deployment:** - deployment.yaml: Kubernetes deployment manifest with GPU support * 4-container pod: semantic-router, model-a, model-b, envoy-proxy * CDI annotations for GPU device injection (gpu=0, gpu=1) * GPU node selection and tolerations * PVC mounts for models and cache * Production log levels (INFO for containers, info for Envoy) - deploy-to-openshift.sh: Main deployment automation script (826 lines) * Auto-detection of OpenShift server and existing login * Enhanced deployment method with llm-katan specialists * Alternative methods: kustomize, template * Configurable resources, storage, logging * Automatic namespace creation * Inline Dockerfile build for llm-katan image * Service and route creation * Optional port forwarding (disabled by default) * Displays OpenWebUI endpoint at completion - cleanup-openshift.sh: Cleanup automation script (494 lines) * Auto-detection of cluster and namespace * Graceful cleanup with confirmation * Port forwarding cleanup * Comprehensive resource deletion **Configuration:** - config-openshift.yaml: Semantic router config for OpenShift * Math-specialist and coding-specialist endpoints * Category-to-specialist routing * PII and jailbreak detection configuration - envoy-openshift.yaml: Envoy proxy configuration * HTTP listener on port 8801 * External processing filter * Specialist model routing * /v1/models aggregation **Container Image:** - Dockerfile.llm-katan: GPU-enabled specialist container image * Python 3.10-slim base * PyTorch with CUDA 12.1 support * llm-katan, transformers, accelerate packages * HuggingFace caching configuration * Health check endpoint **Alternative Deployment Methods:** - kustomization.yaml: Kustomize deployment option - template.yaml: OpenShift template with parameters **Documentation & Validation:** - README.md: Comprehensive deployment documentation - validate-deployment.sh: 12-test validation script * Namespace, deployment, container readiness * GPU detection in both specialist containers * Model loading verification * PVC, service, route checks * GPU node scheduling confirmation - Makefile: Add include for tools/make/openshift.mk - tools/make/openshift.mk: Optional make targets for OpenShift operations * openshift-deploy, openshift-cleanup, openshift-status * openshift-logs, openshift-routes, openshift-test * Port forwarding helpers 1. **GPU Support**: Full NVIDIA GPU support via CDI device injection 2. **Specialist Models**: Real llm-katan containers for math/coding tasks 3. **Zero-Touch Deployment**: Auto-detection of cluster, automatic builds 4. **Production Ready**: Production log levels, proper health checks 5. **Validation**: Comprehensive 12-test validation suite 6. **UX Enhancements**: OpenWebUI endpoint display, optional port forwarding 7. **Clean Separation**: Only touches deploy/openshift/ (plus minimal Makefile) ``` Pod: semantic-router ├── semantic-router (main ExtProc service, port 50051) ├── model-a (llm-katan math specialist, port 8000, GPU 0) ├── model-b (llm-katan coding specialist, port 8001, GPU 1) └── envoy-proxy (gateway, port 8801) ``` Validated on OpenShift with NVIDIA L4 GPUs: - All 4 containers running - GPUs detected in both specialist containers - Models loaded on CUDA - PVCs bound - Services and routes accessible - Streaming functionality working 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Yossi Ovadia <[email protected]>
Routes are created without TLS termination by default, so URLs should use http:// not https://. This fixes the quick test commands shown at deployment completion. Tested and verified: - curl http://semantic-router-api.../health works - curl -X POST http://semantic-router-api.../api/v1/classify/intent works Signed-off-by: Yossi Ovadia <[email protected]>
27750ad
to
f938960
Compare
👥 vLLM Semantic Team NotificationThe following members have been identified for the changed files in this PR and have been automatically assigned: 📁
|
@@ -0,0 +1,45 @@ | |||
# Optimized Dockerfile for llm-katan - OpenShift compatible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need this here? there is already a copy here https://github.com/vllm-project/semantic-router/blob/main/e2e-tests/llm-katan/Dockerfile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds comprehensive OpenShift deployment infrastructure for the semantic router with NVIDIA GPU support and specialized LLM containers. It provides zero-touch deployment automation with validation scripts and supports a 4-container pod architecture including semantic-router service, two GPU-enabled model specialists (model-a and model-b), and an envoy-proxy for HTTP gateway functionality.
- Complete OpenShift deployment manifests with GPU scheduling and security contexts
- Automated deployment and validation scripts with error handling and port forwarding
- OpenShift-specific configurations for Envoy proxy and model routing
Reviewed Changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 9 comments.
Show a summary per file
File | Description |
---|---|
tools/make/openshift.mk | Makefile targets for OpenShift operations including login, deploy, status, and debugging |
deploy/openshift/validate-deployment.sh | Comprehensive validation script testing all 4 containers, GPU detection, and model loading |
deploy/openshift/template.yaml | OpenShift template for parameterized deployment with security contexts |
deploy/openshift/kustomization.yaml | Kustomize configuration for OpenShift deployment with labels and annotations |
deploy/openshift/envoy-openshift.yaml | OpenShift-specific Envoy configuration using static clusters for pod networking |
deploy/openshift/deployment.yaml | Main deployment manifest with 4-container pod, GPU scheduling, and init container |
deploy/openshift/deploy-to-openshift.sh | Automated deployment script with login detection, build management, and port forwarding |
deploy/openshift/config-openshift.yaml | OpenShift-specific router configuration with localhost endpoints and model policies |
deploy/openshift/cleanup-openshift.sh | Comprehensive cleanup script with multiple cleanup levels and safety confirmations |
deploy/openshift/README.md | Documentation for OpenShift deployment with troubleshooting and monitoring guides |
deploy/openshift/Dockerfile.llm-katan | Dockerfile for building llm-katan specialist containers with CUDA support |
Makefile | Include openshift.mk in the main Makefile |
Comments suppressed due to low confidence (2)
deploy/openshift/deployment.yaml:1
- Hardcoded storage class 'gp3-csi' may not be available on all OpenShift clusters. Consider making this configurable or using the cluster's default storage class.
apiVersion: apps/v1
deploy/openshift/deployment.yaml:1
- Hardcoded storage class 'gp3-csi' may not be available on all OpenShift clusters. Consider making this configurable or using the cluster's default storage class.
apiVersion: apps/v1
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
echo "$(RED)[ERROR]$(NC) OPENSHIFT_PASSWORD is required"; \ | ||
exit 1; \ | ||
fi | ||
@oc login -u $(OPENSHIFT_USER) -p $(OPENSHIFT_PASSWORD) $(OPENSHIFT_SERVER) --insecure-skip-tls-verify |
Copilot
AI
Oct 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using --insecure-skip-tls-verify bypasses SSL certificate validation, which poses a security risk. Consider making this configurable or documenting the security implications.
Copilot uses AI. Check for mistakes.
fi | ||
|
||
log "INFO" "Logging into OpenShift at $OPENSHIFT_SERVER as $OPENSHIFT_USER" | ||
if ! oc login -u "$OPENSHIFT_USER" -p "$OPENSHIFT_PASSWORD" "$OPENSHIFT_SERVER" --insecure-skip-tls-verify; then |
Copilot
AI
Oct 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using --insecure-skip-tls-verify bypasses SSL certificate validation. This should be configurable or at least documented as a security consideration for production deployments.
Copilot uses AI. Check for mistakes.
return 0 | ||
fi | ||
|
||
if ! oc login -u "$OPENSHIFT_USER" -p "$OPENSHIFT_PASSWORD" "$OPENSHIFT_SERVER" --insecure-skip-tls-verify; then |
Copilot
AI
Oct 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using --insecure-skip-tls-verify bypasses SSL certificate validation. Consider making this configurable for production environments.
Copilot uses AI. Check for mistakes.
# Wait for python imagestream to be ready | ||
log "INFO" "Waiting for python imagestream to be ready..." | ||
sleep 5 | ||
while ! oc get istag python:3.10-slim -n "$NAMESPACE" &> /dev/null; do | ||
sleep 2 | ||
done | ||
log "SUCCESS" "Python imagestream ready" | ||
|
Copilot
AI
Oct 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script waits for 'python:3.10-slim' imagestream tag but the Dockerfile uses 'python:3.10-slim' as a base image. OpenShift may not automatically create an imagestream for external images, potentially causing an infinite loop.
# Wait for python imagestream to be ready | |
log "INFO" "Waiting for python imagestream to be ready..." | |
sleep 5 | |
while ! oc get istag python:3.10-slim -n "$NAMESPACE" &> /dev/null; do | |
sleep 2 | |
done | |
log "SUCCESS" "Python imagestream ready" |
Copilot uses AI. Check for mistakes.
# Test 4: Check GPU detection in model-a | ||
log "INFO" "Test 4: Checking GPU detection in model-a container..." | ||
if [[ -n "$POD_NAME" ]]; then | ||
GPU_CHECK=$(oc exec -n "$NAMESPACE" "$POD_NAME" -c model-a -- python3 -c "import torch; print('CUDA' if torch.cuda.is_available() else 'CPU')" 2>/dev/null || echo "ERROR") |
Copilot
AI
Oct 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GPU detection logic is duplicated for model-a and model-b containers. Consider extracting this into a reusable function to reduce code duplication.
Copilot uses AI. Check for mistakes.
jailbreak_mapping_path: "models/jailbreak_classifier_modernbert-base_model/jailbreak_type_mapping.json" | ||
|
||
# vLLM Endpoints Configuration | ||
# IMPORTANT: Using localhost since containers are in same pod |
Copilot
AI
Oct 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] While the comments explain the localhost usage, consider adding more context about why this differs from the standard Kubernetes configuration and the pod networking implications.
# IMPORTANT: Using localhost since containers are in same pod | |
# IMPORTANT: Using localhost (127.0.0.1) here because all vLLM containers and the orchestrator are running within the same Kubernetes pod. | |
# In standard Kubernetes configurations, containers in different pods cannot communicate via localhost; they must use service names or pod IPs over the pod network. | |
# If you move any of these containers to separate pods, you must update the address fields below to use the appropriate service name or pod IP instead of 127.0.0.1. |
Copilot uses AI. Check for mistakes.
# IMPORTANT: Using localhost since containers are in same pod | ||
vllm_endpoints: | ||
- name: "model-a-endpoint" | ||
address: "127.0.0.1" # localhost in same pod | ||
port: 8000 | ||
models: | ||
- "Model-A" | ||
weight: 1 | ||
- name: "model-b-endpoint" | ||
address: "127.0.0.1" # localhost in same pod |
Copilot
AI
Oct 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] While the comments explain the localhost usage, consider adding more context about why this differs from the standard Kubernetes configuration and the pod networking implications.
# IMPORTANT: Using localhost since containers are in same pod | |
vllm_endpoints: | |
- name: "model-a-endpoint" | |
address: "127.0.0.1" # localhost in same pod | |
port: 8000 | |
models: | |
- "Model-A" | |
weight: 1 | |
- name: "model-b-endpoint" | |
address: "127.0.0.1" # localhost in same pod | |
# IMPORTANT: Using localhost (127.0.0.1) here because all containers are running in the same pod and thus share the same network namespace. | |
# This differs from standard Kubernetes practice, where services typically communicate over the pod network using service names or pod IPs. | |
# If these containers are ever split into separate pods, using localhost will break communication; in that case, update the address to use the appropriate service name or pod IP. | |
vllm_endpoints: | |
- name: "model-a-endpoint" | |
address: "127.0.0.1" # localhost; only works because containers share the pod's network namespace | |
port: 8000 | |
models: | |
- "Model-A" | |
weight: 1 | |
- name: "model-b-endpoint" | |
address: "127.0.0.1" # localhost; only works because containers share the pod's network namespace |
Copilot uses AI. Check for mistakes.
cpu: "2" | ||
# Real LLM specialist containers using llm-katan | ||
- name: model-a | ||
image: image-registry.openshift-image-registry.svc:5000/vllm-semantic-router-system/llm-katan:latest |
Copilot
AI
Oct 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The image reference is hardcoded with the namespace. This creates a dependency between the image registry path and the deployment namespace, making it less portable.
image: image-registry.openshift-image-registry.svc:5000/vllm-semantic-router-system/llm-katan:latest | |
image: image-registry.openshift-image-registry.svc:5000/$(NAMESPACE)/llm-katan:latest |
Copilot uses AI. Check for mistakes.
cpu: "1" | ||
nvidia.com/gpu: "1" | ||
- name: model-b | ||
image: image-registry.openshift-image-registry.svc:5000/vllm-semantic-router-system/llm-katan:latest |
Copilot
AI
Oct 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The image reference is hardcoded with the namespace. This creates a dependency between the image registry path and the deployment namespace, making it less portable.
image: image-registry.openshift-image-registry.svc:5000/vllm-semantic-router-system/llm-katan:latest | |
image: image-registry.openshift-image-registry.svc:5000/${NAMESPACE}/llm-katan:latest |
Copilot uses AI. Check for mistakes.
Yes, its gpu specific
…On Wed, Oct 8, 2025, 1:52 PM Huamin Chen ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In deploy/openshift/Dockerfile.llm-katan
<#376 (comment)>
:
> @@ -0,0 +1,45 @@
+# Optimized Dockerfile for llm-katan - OpenShift compatible
do you need this here? there is already a copy here
https://github.com/vllm-project/semantic-router/blob/main/e2e-tests/llm-katan/Dockerfile
—
Reply to this email directly, view it on GitHub
<#376 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAOECUANDGIWUYAP3N464KT3WV2RTAVCNFSM6AAAAACIVIKIL6VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTGMJWGQ4DOMRYGI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
@yossiovadia I can help you set up o11y stack(Grafana+Prometheus) in OpenShift when you make it work. |
@JaredforReal - really appreciate ! i have already started ( finished actually ) to work on it . |
Overview
Adds complete OpenShift deployment infrastructure with NVIDIA GPU support for running semantic router with specialized LLM containers. Provides zero-touch deployment automation with comprehensive validation.
Architecture
Pod: semantic-router (4 containers)
├── semantic-router - Main ExtProc service (port 50051, 8080)
├── model-a - Math specialist (port 8000, GPU 0)
├── model-b - Coding specialist (port 8001, GPU 1)
└── envoy-proxy - HTTP gateway (port 8801)
Quick Start
OpenWebUI Integration
The deployment provides an OpenWebUI-compatible endpoint:
http://envoy-http-./v1
Configure this URL in OpenWebUI settings to use the semantic router as your LLM backend with automatic category-based routing and security guardrails.